* __ _ __ . o * .
/ /_(_)__/ /_ ___ _____ _______ ___
/ __/ / _ / // / |/ / -_) __(_-</ -_)
\__/_/\_,_/\_, /|___/\__/_/ /___/\__/
* . /___/ o . *
We are going to be working around in the tidyverse for a good chunk of our time together. The whole point of the tidyverse is to offer a grammar of verbs. It is going to help us in a lot of the situations that we are going to be seeing.
Another great feature of the tidyverse is the pipe: %>%
It does the same thing as the Unix |, but | in R is an or operator.
With all of the glowing praise for the tidyverse, we are still going to see some base R. Sometimes, it will demonstrate great reasons for using the tidyverse. In other situations, it will help you to not be afraid to use it when situations arise.
library(ggplot2)
plotDat = aggregate(diamonds$cut, by = list(cut = diamonds$cut),
FUN = length)
colnames(plotDat)[2] = "n"
plotDat cut n
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
ggplot(plotDat, aes(x = cut, y = n)) +
geom_point(aes(size = n)) +
theme_minimal()Look at help(mtcars) and check out the variables. Can you spot what is wrong with this plot?
ggplot(mtcars, aes(x = wt, y = mpg, color = am)) +
geom_point() +
theme_minimal()The plot below is likely better.
library(dplyr)
mtcars$amFactor = as.factor(mtcars$am)
ggplot(mtcars, aes(x = wt, y = mpg, color = amFactor)) +
geom_point() +
theme_minimal()Recall some of the things that we just saw:
plotDat = aggregate(diamonds$cut, by = list(cut = diamonds$cut), FUN = length)
colnames(plotDat)[2] = "n"
ggplot(plotDat, aes(x = cut, y = n)) +
geom_point(aes(size = n)) +
theme_minimal()This is somewhat tricky code. We have to create a new object with the oft-muddy aggregate and reset a column name (by magic number in an index, no less).
This can be made much easier with dplyr:
diamonds %>%
group_by(cut) %>%
summarize(n = n()) %>%
ggplot(., aes(x = cut, y = n)) +
geom_point(aes(size = n)) +
theme_minimal()It isn’t a reduction in lines, but it is certainly clearer and follows a more logical thought process. This is the whole point of the tidyverse (and dplyr specifically) – allowing you to write how you would explain the process.
As an added bonus, we don’t need to create a bunch of different objects to do something simple.
We can see that dplyr will also make the plot for am easier.
mtcars %>%
mutate(am = as.factor(am)) %>%
ggplot(., aes(x = wt, y = mpg, color = am)) +
geom_point() +
theme_minimal()You will often notice that a dplyr chunk might take a few more lines to work through than base R alone – don’t consider this as a bad thing. There will be many times in this course and in your potential work that you might think that you need to use as few lines as possible. Resist this temptation. Sometime you need to break something up into many lines and create new objects – this ability is exactly why we use R!
Importing data is often the easiest part (never too hard to import a nice .csv). Sometimes, though, we need some other strategies.
Frequently, you will see nicely delimited text files that are not .csv files – these are often tab-delimited files, but they can take other forms.
read.table("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment",
header = TRUE, sep = "\t")Is the same as:
read.delim("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment")The read.table() function gives you added flexibility to specify many different parameters.
Examine the following file from SDC Platinum and read it in properly:
How did you do?
Did you notice anything about these files? They are not really very big, but they might have taken a little bit of time to read in. There have been times where people have commented that R is too slow on the read side. If you find you files are not being read quickly enough, you can try a few alternatives: readr and data.table
Try the following:
library(readr)
readrTest = read_delim("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment",
delim = "\t")library(data.table)
dtTest = fread("https://download.bls.gov/pub/time.series/ce/ce.data.42a.RetailTrade.Employment",
sep = "\t")That SDC file that might have taken a few minutes will now take just a few seconds:
sdc = read_delim("https://www3.nd.edu/~sberry5/data/sdcTest.txt",
delim = "^")Pretty awesome, right?
While readr works wonderfully on the read and write side, data.table is great for wrangling data that is a bit on the big side and is all together blazing fast. However, it does not shy away from confusing syntax and weird conventions. With that in mind, we won’t be using it in this class, but do keep it in the back of your mind.
At times, you will get data in some proprietary format. That is when you need to turn to other places.
Download the following Excel file: https://www3.nd.edu/~sberry5/data/excelTest.xlsx
readxl::read_excel(path = "")What do we know about Excel workbooks? Check out the help on readxl and let me know our path forward.
haven::read_sas(data_file = "https://www3.nd.edu/~sberry5/data/wciklink_gvkey.sas7bdat")haven::read_dta(file = "https://www3.nd.edu/~sberry5/data/stataExample.dta")We often see the -99 added as the missing value in SPSS (of course, there is no way that -99 would ever be an actual value, right?).
haven::read_spss(file = "https://www3.nd.edu/~sberry5/data/spssExample.sav",
user_na = "-99")Depending on your needs, reading an html table into R is getting to be too easy.
library(rvest)
cpi = read_html("http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/") %>%
html_table(fill = TRUE)Things might get a bit tricky:
highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>%
html_table(fill = TRUE)What is the return of this call?
For many of these tasks, you can just use the rio package – you give it the file and it will do the rest!
rio::import("folder/file")Web-based graphics started getting popular not too long ago. Generally, stats people were not using them, but web developer-type folks were. They needed a structure that would work well for the web and interact with their JavaScript-based graphics – thus, JavaScript Object Notation (JSON) was born. You will see JSON come out of many web-based interfaces.
This is what JSON looks like.
There are a few JSON-reading packages in R, but jsonlite tends to work pretty well.
jsonTest = jsonlite::read_json(path = "https://www3.nd.edu/~sberry5/data/optionsDataBrief.json",
simplifyVector = TRUE)This is a very simple form of JSON. We are going to see a hairier version of this data in the coming days.
There is JSON and then there is JSON. You might find yourself some interesting data and want to bring it in, but an error happens and you have no idea why the read_json function is telling you that the file is not JSON.
Not all JSON is pure JSON! When that is the case, you will need to create pure JSON.
Look at this file: https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json
It looks like JSON, but…
jsonlite::validate("https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json")If we would want to read that in as true JSON, we would need to do some work:
musicalInstruments = readLines("https://www3.nd.edu/~sberry5/data/reviews_Musical_Instruments_5.json")
musicalInstruments = paste(unlist(lapply(musicalInstruments, function(x) {
paste(x, ",", sep = "")
})), collapse = "")
musicalInstruments = paste("[", musicalInstruments, "]", sep = "")
musicalInstruments = gsub("},]", "}]", musicalInstruments)Everything we just learned is great and you will use them all in your data wrangling missions.
Fortunately (or unfortunately, depending on how you look at it), it is not the whole story – you will frequently be reading in many files of the same time.
If you have two files, you might be able to get away with brute force:
# DO NOT RUN:
myData1 = read.csv("test.csv")
myData2 = read.csv("test2.csv")Would you want to do this for 5 files? What about 100? Or 1000? I will answer it for you: no!
The chunks below introduce some very important functions. We are going to see lapply again – it is important that you learn to love the apply family!
# DO NOT RUN:
allFiles = list.files(path = "", all.files = TRUE, full.names = TRUE,
recursive = TRUE, include.dirs = FALSE)
allFilesRead = lapply(allFiles, function(x) read.csv(x, stringsAsFactors = FALSE))
allData = do.call("rbind", allFilesRead)You can also use rio:
# DO NOT RUN:
rio::import_list("", rbind = TRUE)One of the major aims of the tidyverse is to provide a clear and consistent grammar to data manipulation. This is helpful when diving deeper into the weeds.
Do you remember this?
highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>%
html_table(fill = TRUE)What did we get out of this? It was a big list of data frames. If we are looking for only one thing and we know that it is the first thing, we have some options:
highest = highest[[1]]This is great for keeping the object at first and then plucking out what we want. If you want the whole thing to be together, though, we have even more options:
highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>%
html_table(fill = TRUE) %>%
`[[`(1)And now we see why R mystifies people. What does is that bit of nonsense at the end. It is really just an index shortcut. Once you know how to use it, it is great; however, it will make you shake your head if you see it in the wild without knowing about it first.
This is where the benefit of tidyverse becomes clear.
highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>%
html_table(fill = TRUE) %>%
magrittr::extract2(1)Or…
highest = read_html("https://en.wikipedia.org/wiki/List_of_highest-grossing_films") %>%
html_table(fill = TRUE) %>%
purrr::pluck(1)Both functions are doing the same thing and with slightly different names, but it is crystal-clear what they are doing.
Do be careful, though, because we can have some issues in function masking and pluck from purrr does something very different than pluck from dplyr.
Someone try it and tell me what happens!
There are many ways to select variables with base R:
mtcars[, c(1:5, 7:8)]
keepers = c("mpg", "cyl", "disp", "hp", "drat", "qsec", "vs")
mtcars[, keepers]
mtcars[, c("mpg", grep("^c", names(mtcars), values = TRUE))]You can also drop variables:
mtcars[, -c(1:2)] disp hp drat wt qsec vs am gear carb amFactor
Mazda RX4 160.0 110 3.90 2.620 16.46 0 1 4 4 1
Mazda RX4 Wag 160.0 110 3.90 2.875 17.02 0 1 4 4 1
Datsun 710 108.0 93 3.85 2.320 18.61 1 1 4 1 1
Hornet 4 Drive 258.0 110 3.08 3.215 19.44 1 0 3 1 0
Hornet Sportabout 360.0 175 3.15 3.440 17.02 0 0 3 2 0
Valiant 225.0 105 2.76 3.460 20.22 1 0 3 1 0
Duster 360 360.0 245 3.21 3.570 15.84 0 0 3 4 0
Merc 240D 146.7 62 3.69 3.190 20.00 1 0 4 2 0
Merc 230 140.8 95 3.92 3.150 22.90 1 0 4 2 0
Merc 280 167.6 123 3.92 3.440 18.30 1 0 4 4 0
Merc 280C 167.6 123 3.92 3.440 18.90 1 0 4 4 0
Merc 450SE 275.8 180 3.07 4.070 17.40 0 0 3 3 0
Merc 450SL 275.8 180 3.07 3.730 17.60 0 0 3 3 0
Merc 450SLC 275.8 180 3.07 3.780 18.00 0 0 3 3 0
Cadillac Fleetwood 472.0 205 2.93 5.250 17.98 0 0 3 4 0
Lincoln Continental 460.0 215 3.00 5.424 17.82 0 0 3 4 0
Chrysler Imperial 440.0 230 3.23 5.345 17.42 0 0 3 4 0
Fiat 128 78.7 66 4.08 2.200 19.47 1 1 4 1 1
Honda Civic 75.7 52 4.93 1.615 18.52 1 1 4 2 1
Toyota Corolla 71.1 65 4.22 1.835 19.90 1 1 4 1 1
Toyota Corona 120.1 97 3.70 2.465 20.01 1 0 3 1 0
Dodge Challenger 318.0 150 2.76 3.520 16.87 0 0 3 2 0
AMC Javelin 304.0 150 3.15 3.435 17.30 0 0 3 2 0
Camaro Z28 350.0 245 3.73 3.840 15.41 0 0 3 4 0
Pontiac Firebird 400.0 175 3.08 3.845 17.05 0 0 3 2 0
Fiat X1-9 79.0 66 4.08 1.935 18.90 1 1 4 1 1
Porsche 914-2 120.3 91 4.43 2.140 16.70 0 1 5 2 1
Lotus Europa 95.1 113 3.77 1.513 16.90 1 1 5 2 1
Ford Pantera L 351.0 264 4.22 3.170 14.50 0 1 5 4 1
Ferrari Dino 145.0 175 3.62 2.770 15.50 0 1 5 6 1
Maserati Bora 301.0 335 3.54 3.570 14.60 0 1 5 8 1
Volvo 142E 121.0 109 4.11 2.780 18.60 1 1 4 2 1
dropVars = c("vs", "drat")
mtcars[, !(names(mtcars) %in% dropVars)] mpg cyl disp hp wt qsec am gear carb amFactor
Mazda RX4 21.0 6 160.0 110 2.620 16.46 1 4 4 1
Mazda RX4 Wag 21.0 6 160.0 110 2.875 17.02 1 4 4 1
Datsun 710 22.8 4 108.0 93 2.320 18.61 1 4 1 1
Hornet 4 Drive 21.4 6 258.0 110 3.215 19.44 0 3 1 0
Hornet Sportabout 18.7 8 360.0 175 3.440 17.02 0 3 2 0
Valiant 18.1 6 225.0 105 3.460 20.22 0 3 1 0
Duster 360 14.3 8 360.0 245 3.570 15.84 0 3 4 0
Merc 240D 24.4 4 146.7 62 3.190 20.00 0 4 2 0
Merc 230 22.8 4 140.8 95 3.150 22.90 0 4 2 0
Merc 280 19.2 6 167.6 123 3.440 18.30 0 4 4 0
Merc 280C 17.8 6 167.6 123 3.440 18.90 0 4 4 0
Merc 450SE 16.4 8 275.8 180 4.070 17.40 0 3 3 0
Merc 450SL 17.3 8 275.8 180 3.730 17.60 0 3 3 0
Merc 450SLC 15.2 8 275.8 180 3.780 18.00 0 3 3 0
Cadillac Fleetwood 10.4 8 472.0 205 5.250 17.98 0 3 4 0
Lincoln Continental 10.4 8 460.0 215 5.424 17.82 0 3 4 0
Chrysler Imperial 14.7 8 440.0 230 5.345 17.42 0 3 4 0
Fiat 128 32.4 4 78.7 66 2.200 19.47 1 4 1 1
Honda Civic 30.4 4 75.7 52 1.615 18.52 1 4 2 1
Toyota Corolla 33.9 4 71.1 65 1.835 19.90 1 4 1 1
Toyota Corona 21.5 4 120.1 97 2.465 20.01 0 3 1 0
Dodge Challenger 15.5 8 318.0 150 3.520 16.87 0 3 2 0
AMC Javelin 15.2 8 304.0 150 3.435 17.30 0 3 2 0
Camaro Z28 13.3 8 350.0 245 3.840 15.41 0 3 4 0
Pontiac Firebird 19.2 8 400.0 175 3.845 17.05 0 3 2 0
Fiat X1-9 27.3 4 79.0 66 1.935 18.90 1 4 1 1
Porsche 914-2 26.0 4 120.3 91 2.140 16.70 1 5 2 1
Lotus Europa 30.4 4 95.1 113 1.513 16.90 1 5 2 1
Ford Pantera L 15.8 8 351.0 264 3.170 14.50 1 5 4 1
Ferrari Dino 19.7 6 145.0 175 2.770 15.50 1 5 6 1
Maserati Bora 15.0 8 301.0 335 3.570 14.60 1 5 8 1
Volvo 142E 21.4 4 121.0 109 2.780 18.60 1 4 2 1
Issues?
For starters, the magic numbers are a no-go. The keepers lines could work, but would be a pain if we had a lot of variables.
Let’s check this wacky stuff out where we want all variables that start with “age” and variables that likely represent questions (x1, x2, x3, …):
library(lavaan)
testData = HolzingerSwineford1939
names(testData) [1] "id" "sex" "ageyr" "agemo" "school" "grade" "x1"
[8] "x2" "x3" "x4" "x5" "x6" "x7" "x8"
[15] "x9"
keepers = c(grep("^age", names(testData), value = TRUE),
paste("x", 1:9, sep = ""))
testData = testData[, keepers]Not only do we have another regular expression, but we also have this paste line to create variable names. It seems like too much work to do something simple!
While not beautiful, these are perfectly valid ways to do this work. I have such sights to show you, but don’t forget about this stuff – you never know when you might need to use it.
We have already seen a bit of dplyr, but we are going to dive right into some of the functions now.
In base R, we have to do some chanting to select our variables. With dplyr, we can just use select:
mtcars %>%
select(mpg, cyl, am) mpg cyl am
Mazda RX4 21.0 6 1
Mazda RX4 Wag 21.0 6 1
Datsun 710 22.8 4 1
Hornet 4 Drive 21.4 6 0
Hornet Sportabout 18.7 8 0
Valiant 18.1 6 0
Duster 360 14.3 8 0
Merc 240D 24.4 4 0
Merc 230 22.8 4 0
Merc 280 19.2 6 0
Merc 280C 17.8 6 0
Merc 450SE 16.4 8 0
Merc 450SL 17.3 8 0
Merc 450SLC 15.2 8 0
Cadillac Fleetwood 10.4 8 0
Lincoln Continental 10.4 8 0
Chrysler Imperial 14.7 8 0
Fiat 128 32.4 4 1
Honda Civic 30.4 4 1
Toyota Corolla 33.9 4 1
Toyota Corona 21.5 4 0
Dodge Challenger 15.5 8 0
AMC Javelin 15.2 8 0
Camaro Z28 13.3 8 0
Pontiac Firebird 19.2 8 0
Fiat X1-9 27.3 4 1
Porsche 914-2 26.0 4 1
Lotus Europa 30.4 4 1
Ford Pantera L 15.8 8 1
Ferrari Dino 19.7 6 1
Maserati Bora 15.0 8 1
Volvo 142E 21.4 4 1
We can also drop variables with the -:
mtcars %>%
select(-vs) mpg cyl disp hp drat wt qsec am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 4 2
amFactor
Mazda RX4 1
Mazda RX4 Wag 1
Datsun 710 1
Hornet 4 Drive 0
Hornet Sportabout 0
Valiant 0
Duster 360 0
Merc 240D 0
Merc 230 0
Merc 280 0
Merc 280C 0
Merc 450SE 0
Merc 450SL 0
Merc 450SLC 0
Cadillac Fleetwood 0
Lincoln Continental 0
Chrysler Imperial 0
Fiat 128 1
Honda Civic 1
Toyota Corolla 1
Toyota Corona 0
Dodge Challenger 0
AMC Javelin 0
Camaro Z28 0
Pontiac Firebird 0
Fiat X1-9 1
Porsche 914-2 1
Lotus Europa 1
Ford Pantera L 1
Ferrari Dino 1
Maserati Bora 1
Volvo 142E 1
We also have several helper functions that we can use:
HolzingerSwineford1939 %>%
select(num_range("x", 1:9), starts_with("age"),
matches("^s.*.l$"))Changing variable position in R is a pain:
head(HolzingerSwineford1939[, c(1, 7:15, 2:6)]) id x1 x2 x3 x4 x5 x6 x7 x8 x9
1 1 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
2 2 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
3 3 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
4 4 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
5 5 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
6 6 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000
sex ageyr agemo school grade
1 1 13 1 Pasteur 7
2 2 13 7 Pasteur 7
3 2 13 1 Pasteur 7
4 1 13 2 Pasteur 7
5 2 12 2 Pasteur 7
6 2 14 1 Pasteur 7
HolzingerSwineford1939 %>%
select(id, starts_with("x"), everything()) %>%
head() id x1 x2 x3 x4 x5 x6 x7 x8 x9
1 1 3.333333 7.75 0.375 2.333333 5.75 1.2857143 3.391304 5.75 6.361111
2 2 5.333333 5.25 2.125 1.666667 3.00 1.2857143 3.782609 6.25 7.916667
3 3 4.500000 5.25 1.875 1.000000 1.75 0.4285714 3.260870 3.90 4.416667
4 4 5.333333 7.75 3.000 2.666667 4.50 2.4285714 3.000000 5.30 4.861111
5 5 4.833333 4.75 0.875 2.666667 4.00 2.5714286 3.695652 6.30 5.916667
6 6 5.333333 5.00 2.250 1.000000 3.00 0.8571429 4.347826 6.65 7.500000
sex ageyr agemo school grade
1 1 13 1 Pasteur 7
2 2 13 7 Pasteur 7
3 2 13 1 Pasteur 7
4 1 13 2 Pasteur 7
5 2 12 2 Pasteur 7
6 2 14 1 Pasteur 7
Use that Stata test file.
Grab every lvi, effect, leader, and cred variable
Use summary to understand your data.
Now, just keep every lvi variable.
Use a corrplot to see relationships.
# Just to give you an idea about how it works!
install.packages("corrplot")
data.frame(x = rnorm(10), y = rnorm(10)) %>%
cor() %>%
corrplot()One of the more frequent tasks is related to filtering/subsetting your data. You often want to impose some types of rules on your data (e.g., US only, date ranges).
R gives us all the ability in the world to filter data.
summary(mtcars[mtcars$mpg < mean(mtcars$mpg), ]) mpg cyl disp hp
Min. :10.40 Min. :6.000 Min. :145.0 Min. :105.0
1st Qu.:14.78 1st Qu.:8.000 1st Qu.:275.8 1st Qu.:156.2
Median :15.65 Median :8.000 Median :311.0 Median :180.0
Mean :15.90 Mean :7.556 Mean :313.8 Mean :191.9
3rd Qu.:18.02 3rd Qu.:8.000 3rd Qu.:360.0 3rd Qu.:226.2
Max. :19.70 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :2.770 Min. :14.50 Min. :0.0000
1st Qu.:3.070 1st Qu.:3.440 1st Qu.:16.10 1st Qu.:0.0000
Median :3.150 Median :3.570 Median :17.35 Median :0.0000
Mean :3.302 Mean :3.839 Mean :17.10 Mean :0.1667
3rd Qu.:3.600 3rd Qu.:3.844 3rd Qu.:17.94 3rd Qu.:0.0000
Max. :4.220 Max. :5.424 Max. :20.22 Max. :1.0000
am gear carb amFactor
Min. :0.0000 Min. :3.000 Min. :1.000 0:15
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.250 1: 3
Median :0.0000 Median :3.000 Median :4.000
Mean :0.1667 Mean :3.444 Mean :3.556
3rd Qu.:0.0000 3rd Qu.:3.750 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Unless you know exactly what you are doing, this is a bit hard to read – you might be asking yourself what the comma means and why there is nothing after it.
When we use filter, we are specifying what it is that we want to keep.
Keep this or that:
mtcars %>%
filter(cyl == 4 | cyl == 8) %>%
summary() mpg cyl disp hp
Min. :10.40 Min. :4.00 Min. : 71.1 Min. : 52.0
1st Qu.:15.20 1st Qu.:4.00 1st Qu.:120.1 1st Qu.: 93.0
Median :18.70 Median :8.00 Median :275.8 Median :150.0
Mean :20.19 Mean :6.24 Mean :244.0 Mean :153.5
3rd Qu.:24.40 3rd Qu.:8.00 3rd Qu.:351.0 3rd Qu.:205.0
Max. :33.90 Max. :8.00 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.76 Min. :1.513 Min. :14.50 Min. :0.0
1st Qu.:3.08 1st Qu.:2.320 1st Qu.:16.90 1st Qu.:0.0
Median :3.69 Median :3.435 Median :17.60 Median :0.0
Mean :3.60 Mean :3.245 Mean :17.81 Mean :0.4
3rd Qu.:4.08 3rd Qu.:3.780 3rd Qu.:18.61 3rd Qu.:1.0
Max. :4.93 Max. :5.424 Max. :22.90 Max. :1.0
am gear carb amFactor
Min. :0.0 Min. :3.00 Min. :1.00 0:15
1st Qu.:0.0 1st Qu.:3.00 1st Qu.:2.00 1:10
Median :0.0 Median :3.00 Median :2.00
Mean :0.4 Mean :3.64 Mean :2.64
3rd Qu.:1.0 3rd Qu.:4.00 3rd Qu.:4.00
Max. :1.0 Max. :5.00 Max. :8.00
Keep this and that:
mtcars %>%
filter(cyl == 4 & mpg > 25) %>%
summary() mpg cyl disp hp
Min. :26.00 Min. :4 Min. : 71.10 Min. : 52.00
1st Qu.:28.07 1st Qu.:4 1st Qu.: 76.45 1st Qu.: 65.25
Median :30.40 Median :4 Median : 78.85 Median : 66.00
Mean :30.07 Mean :4 Mean : 86.65 Mean : 75.50
3rd Qu.:31.90 3rd Qu.:4 3rd Qu.: 91.08 3rd Qu.: 84.75
Max. :33.90 Max. :4 Max. :120.30 Max. :113.00
drat wt qsec vs
Min. :3.770 Min. :1.513 Min. :16.70 Min. :0.0000
1st Qu.:4.080 1st Qu.:1.670 1st Qu.:17.30 1st Qu.:1.0000
Median :4.150 Median :1.885 Median :18.71 Median :1.0000
Mean :4.252 Mean :1.873 Mean :18.40 Mean :0.8333
3rd Qu.:4.378 3rd Qu.:2.089 3rd Qu.:19.33 3rd Qu.:1.0000
Max. :4.930 Max. :2.200 Max. :19.90 Max. :1.0000
am gear carb amFactor
Min. :1 Min. :4.000 Min. :1.0 0:0
1st Qu.:1 1st Qu.:4.000 1st Qu.:1.0 1:6
Median :1 Median :4.000 Median :1.5
Mean :1 Mean :4.333 Mean :1.5
3rd Qu.:1 3rd Qu.:4.750 3rd Qu.:2.0
Max. :1 Max. :5.000 Max. :2.0
Filter this out:
mtcars %>%
filter(cyl != 4) %>%
summary() mpg cyl disp hp
Min. :10.40 Min. :6.000 Min. :145.0 Min. :105.0
1st Qu.:15.00 1st Qu.:6.000 1st Qu.:225.0 1st Qu.:123.0
Median :16.40 Median :8.000 Median :301.0 Median :175.0
Mean :16.65 Mean :7.333 Mean :296.5 Mean :180.2
3rd Qu.:19.20 3rd Qu.:8.000 3rd Qu.:360.0 3rd Qu.:215.0
Max. :21.40 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :2.620 Min. :14.50 Min. :0.0000
1st Qu.:3.070 1st Qu.:3.435 1st Qu.:16.46 1st Qu.:0.0000
Median :3.150 Median :3.520 Median :17.30 Median :0.0000
Mean :3.348 Mean :3.705 Mean :17.17 Mean :0.1905
3rd Qu.:3.730 3rd Qu.:3.840 3rd Qu.:17.98 3rd Qu.:0.0000
Max. :4.220 Max. :5.424 Max. :20.22 Max. :1.0000
am gear carb amFactor
Min. :0.0000 Min. :3.000 Min. :1.000 0:16
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 1: 5
Median :0.0000 Median :3.000 Median :4.000
Mean :0.2381 Mean :3.476 Mean :3.476
3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Naturally, it can also take a function
mtcars %>%
filter(mpg < mean(mpg)) %>%
summary() mpg cyl disp hp
Min. :10.40 Min. :6.000 Min. :145.0 Min. :105.0
1st Qu.:14.78 1st Qu.:8.000 1st Qu.:275.8 1st Qu.:156.2
Median :15.65 Median :8.000 Median :311.0 Median :180.0
Mean :15.90 Mean :7.556 Mean :313.8 Mean :191.9
3rd Qu.:18.02 3rd Qu.:8.000 3rd Qu.:360.0 3rd Qu.:226.2
Max. :19.70 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :2.770 Min. :14.50 Min. :0.0000
1st Qu.:3.070 1st Qu.:3.440 1st Qu.:16.10 1st Qu.:0.0000
Median :3.150 Median :3.570 Median :17.35 Median :0.0000
Mean :3.302 Mean :3.839 Mean :17.10 Mean :0.1667
3rd Qu.:3.600 3rd Qu.:3.844 3rd Qu.:17.94 3rd Qu.:0.0000
Max. :4.220 Max. :5.424 Max. :20.22 Max. :1.0000
am gear carb amFactor
Min. :0.0000 Min. :3.000 Min. :1.000 0:15
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.250 1: 3
Median :0.0000 Median :3.000 Median :4.000
Mean :0.1667 Mean :3.444 Mean :3.556
3rd Qu.:0.0000 3rd Qu.:3.750 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
For now, we are going to stick with that stataExample data.
Select the same variables, but also include Rater.
Filter the data on Rater – check the values and filter both ways.
Now check those correlations again!
Throw the Gender variable in and filter on that.
Adding a new variable in base R is as easy as the following:
mtcars$roundedMPG = round(mtcars$mpg)If, however, we want to do things in a tidy chunk, we need to use mutate.
mtcars = mtcars %>%
mutate(roundedMPG = round(mpg))There is also transmute. Can anyone venture a guess as to what it might do?
You will need to recode variables at some point. Depending on the nature of the recode it can be easy (e.g., to reverse code a scale, you just subtract every value by max value + 1).
You will need to do some more elaborate stuff:
mtcars$mpgLoHi = 0
mtcars$mpgLoHi[mtcars$mpg > median(mtcars$mpg)] = 1mtcars$mpgLoHi = ifelse(mtcars$mpg > median(mtcars$mpg), 1, 0)These are pretty good ways to do recoding of this nature, but what about this:
mtcars$vs[which(mtcars$vs == 0)] = "v"
mtcars$vs[which(mtcars$vs == 1)] = "s"Or this:
mtcars$vs = ifelse(mtcars$vs == 0, "v", "s")recode(mtcars$vs, `0` = "v", `1` = "s")For the sake of demonstration, select only the first 10 lvi variables and everything else.
Keep only observations with Rater == 0.
Assume that the first 5 lvi variables (01 through 05) are scores for one assessment and the next five (06 through 10) are scores for another assessment.
Create two new variables to capture the mean of those scores.
You will need to use the rowwise function ahead of mutate.
You can use the mean function, but you will have to wrap the variables in c()
# Just to help you along!
data.frame(x = rnorm(10), y = rnorm(10)) %>%
rowwise() %>%
mutate(test = mean(c(x, y)))We won’t have any big end-of-day wrap exercises to do today. Instead, we are going to learn just a few cool things.
We already saw some ggplot2, but let’s take a few minutes to dive into it a bit more.
Just like everything else in the tidyverse, ggplot2 provides a clear and consistent grammar, except the focus is on data visualization. With ggplot2, we can stack layer after layer into the plotting space to help visualize our data.
Let’s take a look at some good ggplot2 layering:
library(ggplot2)
library(lavaan)
testData = HolzingerSwineford1939
ggplot(testData, aes(x7, ageyr)) +
geom_point()Next, we can add some color:
ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75)Now, we can add a smooth line:
ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75) +
geom_smooth()And we can look at small multiples:
ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75) +
geom_smooth() +
facet_grid(~ sex)Let’s get those silly grey boxes out of there:
ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75) +
geom_smooth() +
facet_grid(~ sex) +
theme_minimal()Perhaps add a better color scheme:
ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75) +
geom_smooth() +
facet_grid(~ sex) +
theme_minimal() +
scale_color_brewer(palette = "Dark2")We could keep going forever and tweak anything that you could imagine (labels, ticks, etc.), but this should give you a pretty good idea about what you can do with regard to static plots.
Oh…but we don’t have to stick with just static plots. We can use the plotly package to make our ggplot object interactive.
library(plotly)
radPlot = ggplot(testData, aes(x7, ageyr)) +
geom_point(aes(color = as.factor(grade)), alpha = .75) +
geom_smooth() +
facet_grid(~ sex) +
theme_minimal() +
scale_color_brewer(palette = "Dark2")
ggplotly(radPlot)You can also build plots with plotly, but we will save that for another day in the future.
Learning to use ggplot2 will pay great dividends – there is absolutely nothing better for creating visualizations. There is even a whole group of packages that do nothing but add stuff into it.
Visualizations are great and they often tell a better story than tables. Sometimes, though, you want to give people a glimpse of the data. The DT package let’s you create interactive data tables (they are JS data tables).
You could give people the entire data to explore:
library(DT)
datatable(testData)You can also use the DT package to tidy your summaries into a nice data frame:
lm(x7 ~ ageyr + school, data = testData) %>%
broom::tidy() %>%
mutate_if(is.numeric, round, 4) %>%
datatable()We don’t want to get too far ahead of ourselves here – we will see more places to use this tomorrow.
Do you have a moment to hear the good word of Donald Knuth? If you want to work in a reproducible fashion base and knitr are here to help you out. The slides you saw earlier and even the document you are seeing now are all done with R Markdown. It is my hope that you will also use R Markdown for your presentations on Thursday.
Since we used the Stata stuff, let’s keep rolling with that. The Rater variable indicates whether the person is a supervisor (0) or a subordinate (3). Since this data comes from a bigger set, this coding might make sense – it makes no sense for the data at hand. I don’t believe that you would ever do this, but someone wants the leader_age variable discretized into two groups – below or at the mean and above the mean. The same goes with leader_tenure and leader_experience. In addition to these changes, someone is nervous about having both raterNum and leaderID available in the data; they are requesting that at least one of them be removed.
We have a few distinct issues to address within this data – what would you propose that we do?